All .csv data used in this project is from https://data.worldbank.org/
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import math
To begin, I am curious if a nations level of wealth can be a good indicator of its income distribution.
Are wealthier nations more likely to have greater economic disparity? This might make sense as wealthier nations may be home to larger coorporations that concentrate wealth and create billionaries.
Or are less wealthy nations where the greatest income inequalities are seen? This might also make sense as developing economies may have greater investment opportunities for those who already hold wealth.
To approach this question, I will look at GPD Per Capita as a nation's measure of wealth and the Gini Index as a nation's measure of income inequality.
The world bank .csv data for the two indicators contains data regarding nations by year. There are also irrelevant columns regarding Country Codes, Indicator Names, Indicator Codes, and an additional column at the end.
Additionally, since the World Bank does not always have updated values each year for some indicators, there are many NaN values and missing year entries for each nation.
#first four rows of the csv file includes information about last update date and source
gdp_df = pd.read_csv('world_bank_csv/gdp_per_capita.csv', skiprows = 4)
#drop unnecessary columns/information
gdp_df = gdp_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
#create new column tracking number of years of missing data per nation
gdp_df['Missing Years'] = gdp_df.isnull().sum(axis=1)
gdp_df.head()
| Country Name | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Missing Years | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 24712.493263 | 26441.619936 | 26893.011506 | 28396.908423 | 28452.170615 | 29350.805019 | 30253.279358 | NaN | NaN | 28 |
| 1 | Africa Eastern and Southern | 147.450369 | 146.853701 | 156.017929 | 182.044085 | 162.169577 | 180.017301 | 190.636220 | 192.126267 | 203.783404 | ... | 1667.992618 | 1648.867514 | 1654.314865 | 1503.859334 | 1401.281053 | 1536.206783 | 1530.161917 | 1481.425292 | 1326.663658 | 0 |
| 2 | Afghanistan | 59.773234 | 59.860900 | 58.458009 | 78.706429 | 82.095307 | 101.108325 | 137.594298 | 160.898434 | 129.108311 | ... | 641.871438 | 637.165464 | 613.856505 | 578.466353 | 509.220100 | 519.888913 | 493.756581 | 507.103392 | 508.808409 | 20 |
| 3 | Africa Western and Central | 107.963779 | 113.114697 | 118.865837 | 123.478967 | 131.892939 | 138.566819 | 144.368395 | 128.620051 | 129.678996 | ... | 1936.390962 | 2123.392433 | 2166.743309 | 1886.248158 | 1666.422406 | 1606.978332 | 1695.959215 | 1772.339155 | 1714.426800 | 0 |
| 4 | Angola | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5100.097027 | 5254.881126 | 5408.411700 | 4166.979833 | 3506.073128 | 4095.810057 | 3289.643995 | 2809.626088 | 1895.770869 | 23 |
5 rows × 63 columns
#first four rows of the csv file includes information about the last update date and source
gini_df = pd.read_csv('world_bank_csv/gini_index.csv', skiprows = 4)
#drop unnecessary columns/information
gini_df = gini_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
#creae new column tracking number of years of missing data per nation
gini_df['Missing Years'] = gini_df.isnull().sum(axis=1)
#print(gini_df.shape) #we see from this that all nations have at least one gini_index
gini_df.head()
| Country Name | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Missing Years | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61 |
| 1 | Africa Eastern and Southern | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61 |
| 2 | Afghanistan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61 |
| 3 | Africa Western and Central | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61 |
| 4 | Angola | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 51.3 | NaN | NaN | 58 |
5 rows × 63 columns
#plotting frequency histogram/distribution of gini coefficients by year
for i in range(gini_df.shape[1]-2):
starting_year = 1960;
current_year = "{}".format(starting_year+i)
gini_df[current_year].hist()
#plotting distribution of gdp per capita by year
for i in range(gdp_df.shape[1]-2):
starting_year = 1960;
current_year = "{}".format(starting_year+i)
gdp_df[current_year].hist()
To make this more friendly to work with, I will take its natural log as a new column in the GDP Per Capita dataframe.
for i in range(gdp_df.shape[1] - 2):
starting_year = 1960;
current_year = "{}".format(starting_year + i)
new_column = "{}_log".format(starting_year + i)
gdp_df[new_column] = np.log(gdp_df[current_year])
#plotting distribution of log gdp per capita by year
i = 0
while i <= 60:
starting_year = 1960;
current_year = "{}_log".format(starting_year+i)
gdp_df[current_year].hist()
i += 1
I create a new dataframe containing the average world GDP Per Capita and Gini Index for each year the World Bank has tracked.
#try taking world average gdp_capita and world average gini_index for each year
#only save averages for years that have both an average gdp_capita and a gini_index
data_df = pd.DataFrame()
years = []
gdp_average = []
gini_average = []
gdp_entries = []
gini_entries = []
i = 0
starting_year = 1960;
while i <= 60:
current_year = "{}".format(starting_year+i)
valid_gdp = gdp_df[current_year].count()
valid_gini = gini_df[current_year].count()
gdp_sum = gdp_df[current_year].sum()
gini_sum = gini_df[current_year].sum()
if ((not math.isnan(gdp_sum) and valid_gdp != 0)
and (not math.isnan(gini_sum) and valid_gini != 0)):
gdp_average.append(gdp_sum * 1.0 / valid_gdp)
gini_average.append(gini_sum * 1.0 / valid_gini)
years.append(current_year)
gdp_entries.append(valid_gdp)
gini_entries.append(valid_gini)
i += 1
data_df['Years'] = years
data_df['GDP Per Capita Average'] = gdp_average
data_df['Gini Index Average'] = gini_average
data_df['Valid GDP'] = gdp_entries
data_df['Valid Gini'] = gini_entries
data_df
| Years | GDP Per Capita Average | Gini Index Average | Valid GDP | Valid Gini | |
|---|---|---|---|---|---|
| 0 | 1967 | 734.022875 | 34.000000 | 154 | 1 |
| 1 | 1969 | 818.051852 | 33.700000 | 158 | 1 |
| 2 | 1971 | 1087.372608 | 37.300000 | 168 | 1 |
| 3 | 1974 | 1987.849371 | 32.650000 | 169 | 2 |
| 4 | 1975 | 2316.430265 | 28.800000 | 172 | 2 |
| 5 | 1978 | 3108.821796 | 35.200000 | 175 | 1 |
| 6 | 1979 | 3698.378374 | 34.960000 | 176 | 5 |
| 7 | 1980 | 4239.627328 | 40.700000 | 188 | 3 |
| 8 | 1981 | 4077.894142 | 41.828571 | 191 | 7 |
| 9 | 1982 | 3893.909840 | 47.200000 | 192 | 2 |
| 10 | 1983 | 3712.542225 | 36.350000 | 193 | 4 |
| 11 | 1984 | 3656.726339 | 44.000000 | 194 | 6 |
| 12 | 1985 | 3643.754821 | 37.781818 | 196 | 11 |
| 13 | 1986 | 4205.815900 | 41.585714 | 198 | 14 |
| 14 | 1987 | 4801.455408 | 37.205000 | 202 | 20 |
| 15 | 1988 | 5159.863046 | 39.844444 | 207 | 9 |
| 16 | 1989 | 5294.697899 | 47.807692 | 207 | 13 |
| 17 | 1990 | 5843.407255 | 40.864706 | 222 | 17 |
| 18 | 1991 | 6039.396432 | 40.036842 | 217 | 19 |
| 19 | 1992 | 6286.610275 | 41.728125 | 220 | 32 |
| 20 | 1993 | 6006.127947 | 44.928571 | 225 | 21 |
| 21 | 1994 | 6269.150993 | 44.511538 | 229 | 26 |
| 22 | 1995 | 7090.469727 | 38.818919 | 239 | 37 |
| 23 | 1996 | 7309.985499 | 44.246875 | 239 | 32 |
| 24 | 1997 | 7183.809444 | 44.522222 | 239 | 27 |
| 25 | 1998 | 7397.620838 | 44.072973 | 241 | 37 |
| 26 | 1999 | 7682.781539 | 44.787097 | 243 | 31 |
| 27 | 2000 | 7576.442171 | 40.598000 | 248 | 50 |
| 28 | 2001 | 7514.594204 | 43.188889 | 248 | 36 |
| 29 | 2002 | 8092.683643 | 41.867925 | 253 | 53 |
| 30 | 2003 | 9279.537651 | 39.391525 | 253 | 59 |
| 31 | 2004 | 10557.975823 | 37.935714 | 253 | 70 |
| 32 | 2005 | 11417.672679 | 39.132895 | 253 | 76 |
| 33 | 2006 | 12739.834524 | 37.734667 | 254 | 75 |
| 34 | 2007 | 14502.181554 | 36.515714 | 254 | 70 |
| 35 | 2008 | 15627.290859 | 37.490411 | 253 | 73 |
| 36 | 2009 | 13936.923463 | 37.567949 | 253 | 78 |
| 37 | 2010 | 14611.471052 | 36.506024 | 254 | 83 |
| 38 | 2011 | 16112.808256 | 36.170130 | 256 | 77 |
| 39 | 2012 | 16046.845902 | 36.327381 | 255 | 84 |
| 40 | 2013 | 16591.592454 | 36.418421 | 256 | 76 |
| 41 | 2014 | 16902.682546 | 36.554217 | 256 | 83 |
| 42 | 2015 | 15325.939761 | 36.789024 | 255 | 82 |
| 43 | 2016 | 15495.810614 | 36.228395 | 254 | 81 |
| 44 | 2017 | 16281.485649 | 35.947222 | 254 | 72 |
| 45 | 2018 | 17258.333523 | 35.546875 | 254 | 64 |
| 46 | 2019 | 15954.548397 | 40.972727 | 247 | 22 |
#plot the scatterplot of these averages by year
data_df.plot.scatter(y = 'Gini Index Average', x = 'GDP Per Capita Average', c = 'b')
plt.show()
Looking at the graph, it seems like the polynomial should be three or four degrees. I split the data into training and test datasets to check for overfitting.
from sklearn.model_selection import train_test_split
#polynomial regression
training_df, test_df = train_test_split(data_df, test_size = 0.20)
results_df = pd.DataFrame()
results_df['GDP Per Capita Average'] = training_df['GDP Per Capita Average']
results_df['Actual Gini'] = training_df['Gini Index Average']
coefficients = np.polyfit(training_df['GDP Per Capita Average'], training_df['Gini Index Average'], deg = 4)
x = training_df['GDP Per Capita Average']
results_df['Predictions Gini'] = coefficients[0]*x**4 + coefficients[1]*x**3 + coefficients[2]*x**2 + coefficients[3]*x + coefficients[4]
results_df
plt.scatter(results_df['GDP Per Capita Average'], results_df['Actual Gini'], c = 'b')
plt.scatter(results_df['GDP Per Capita Average'], results_df['Predictions Gini'], c = 'r')
plt.show()
#use the same coefficients on the testing data, plot the residuals
test_results_df = pd.DataFrame()
x = test_df['GDP Per Capita Average']
test_results_df['Predictions'] = coefficients[0]*x**3 + coefficients[1]*x**2 + coefficients[2]*x + coefficients[3]
test_results_df['Actual Gini'] = test_df['Gini Index Average']
test_results_df['Residuals'] = test_results_df['Predictions'] - test_results_df['Actual Gini']
sns.regplot(x = test_results_df['Predictions'], y = test_results_df['Residuals'], ci = None)
plt.title("Residuals Plot")
plt.xlabel("Fitted Value")
plt.ylabel("Residual")
plt.show()
Using the polynomial equation found from the training data on the test data, it becomes apparent that the residuals plot is not evenly distributed. At this point, I abandon the prospect of merely using world averages at certain years. Not only are there too few entries to consider, but some of the initial years (1960, 1961) only have one or two nations that had both a GDP Per Capita and Gini Index region.
However, the existing data is difficult to work with. The data is separated into two dataframes and contains many missing values.
To solve this, I will iterate through the two dataframes year by year, creating a new dataframe with each row being a nation's GDP Per Capita and Gini Index at a specific year. Since I am interested in both GDP Per Capita and Gini Index, I will ignore instances where a nation had a year with one or more missing indicator inputs.
#need to change structure of dataframe to better suit graphing
#make each individual GINI/GDP Per Capita pair by year and nation its own column
new_df = pd.DataFrame()
year = []
country = []
gdp_per_capita = []
gini = []
gdp_log = []
countries = gdp_df['Country Name']
i = 0
starting_year = 1960
while i <= 60:
current_year = "{}".format(starting_year + i)
gdp_values = gdp_df[current_year]
gini_values = gini_df[current_year]
for j in range(len(gdp_values)):
if not math.isnan(gdp_values[j]) and not math.isnan(gini_values[j]):
year.append(current_year)
country.append(countries[j])
gini.append(gini_values[j])
gdp_per_capita.append(gdp_values[j])
gdp_log.append(np.log(gdp_values[j]))
i += 1
new_df['Country'] = country
new_df['Year'] = year
new_df['GDP Per Capita'] = gdp_per_capita
new_df['Gini'] = gini
new_df['GDP_Log'] = gdp_log
new_df.head(20) #first 20 values
| Country | Year | GDP Per Capita | Gini | GDP_Log | |
|---|---|---|---|---|---|
| 0 | Sweden | 1967 | 3720.926845 | 34.0 | 8.221728 |
| 1 | United Kingdom | 1969 | 2100.667869 | 33.7 | 7.650011 |
| 2 | Canada | 1971 | 4520.162878 | 37.3 | 8.416303 |
| 3 | United Kingdom | 1974 | 3665.862798 | 30.0 | 8.206819 |
| 4 | United States | 1974 | 7225.691360 | 35.3 | 8.885398 |
| 5 | Canada | 1975 | 7511.211343 | 33.3 | 8.924152 |
| 6 | Sweden | 1975 | 10117.306684 | 24.3 | 9.222003 |
| 7 | France | 1978 | 9264.775293 | 35.2 | 9.133975 |
| 8 | United Kingdom | 1979 | 7804.762081 | 28.4 | 8.962489 |
| 9 | Norway | 1979 | 13046.537221 | 26.9 | 9.476278 |
| 10 | Panama | 1979 | 1918.162773 | 48.7 | 7.559123 |
| 11 | United States | 1979 | 11674.186310 | 34.5 | 9.365135 |
| 12 | Argentina | 1980 | 2758.834817 | 40.8 | 7.922564 |
| 13 | Spain | 1980 | 6208.578019 | 34.5 | 8.733687 |
| 14 | Madagascar | 1980 | 596.774979 | 46.8 | 6.391540 |
| 15 | Australia | 1981 | 11833.743212 | 31.3 | 9.378710 |
| 16 | Brazil | 1981 | 2132.883317 | 57.9 | 7.665230 |
| 17 | Canada | 1981 | 12337.466249 | 32.4 | 9.420396 |
| 18 | Costa Rica | 1981 | 1068.502425 | 47.5 | 6.974013 |
| 19 | Sweden | 1981 | 15586.430078 | 22.9 | 9.654156 |
plt.scatter(new_df['GDP_Log'], new_df['Gini'], c = 'b')
plt.xlabel("Log GDP Per Capita")
plt.ylabel("Gini Index")
plt.title("World Nations' Gini Index and Log GDP Per Capita")
plt.show()
import plotly.express as px
#play with plotly
fig = px.scatter(new_df, x = 'GDP Per Capita', y = 'Gini', animation_frame = 'Year',
hover_name = 'Country', log_x = True, range_x = [10, 1000000], range_y = [20, 80])
fig.show()
import statsmodels.api as sm
import statsmodels.formula.api as smf
#start with a linear regression with all of the variables
full_model = smf.ols("Gini ~GDP_Log", data = new_df).fit()
print(full_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Gini R-squared: 0.155
Model: OLS Adj. R-squared: 0.154
Method: Least Squares F-statistic: 318.5
Date: Sun, 19 Sep 2021 Prob (F-statistic): 1.49e-65
Time: 23:06:58 Log-Likelihood: -6183.3
No. Observations: 1740 AIC: 1.237e+04
Df Residuals: 1738 BIC: 1.238e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 59.0584 1.168 50.582 0.000 56.768 61.348
GDP_Log -2.3771 0.133 -17.848 0.000 -2.638 -2.116
==============================================================================
Omnibus: 64.104 Durbin-Watson: 1.826
Prob(Omnibus): 0.000 Jarque-Bera (JB): 65.689
Skew: 0.447 Prob(JB): 5.44e-15
Kurtosis: 2.674 Cond. No. 51.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From the residuals plot below, we see that the residuals are slightly unevenly distributed above and below the best zero line. We see that fitted values from roughly 32 to 37 are far closer to the zero line than larger fitted values. However, we will proceed with caution.
The residuals are relatively evenly distributed from left to right. We meet the condition of Constant Variability of Residuals.
sns.regplot(x = full_model.fittedvalues, y = full_model.resid, ci = None)
plt.xlabel("Fitted Value")
plt.ylabel("Residual")
plt.show()
We see that the residuals are normally distributed. We meet condition of Normality of Residuals.
plt.hist(full_model.resid)
plt.xlabel("Residual")
plt.ylabel("Frequency")
plt.show()
The observations are not independent. A nation's GDP Per Capita and Gini Index for one year is heavily influenced by its previous years' GDP Per Capita and Gini Index. However, we will still proceed with caution, as we are interested in the relationship between the two values by year and by nation.
With an R squared of 0.155, there seems to be a weak linear relationship between a nation's GDP Per Capita and Gini Index. Since Log GDP Per Capita had a negative coefficient, they have a weak negative linear relationship.
Even if GDP Per Capita is a relatively weak predictor for Gini Index, would adding more indicators in a multiple variable linear regression lead to a stronger relationship?
I decided to add Adult Literacy Rate (as % of those 15 and above), Birth Rate (per 1000 people), Electricity Access (as % of population), Female Labor (as % of total labor force), Government Debt, Inflation (annual %), Life Expectancy (years), and Population as additional indicators.
#add more information to the dataframe
#take in a bunch of different world bank indicators
literacy_df = pd.read_csv('world_bank_csv/adult_literacy.csv', skiprows = 4)
literacy_df = literacy_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
birth_df = pd.read_csv('world_bank_csv/birth_rate.csv', skiprows = 4)
birth_df = birth_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
electricity_df = pd.read_csv('world_bank_csv/electricity.csv', skiprows = 4)
electricity_df = electricity_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
female_labor_df = pd.read_csv('world_bank_csv/female_labor_force.csv', skiprows = 4)
female_labor_df = female_labor_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
fuel_df = pd.read_csv('world_bank_csv/fuel_exports.csv', skiprows = 4)
fuel_df = fuel_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
govdebt_df = pd.read_csv('world_bank_csv/government_debt.csv', skiprows = 4)
govdebt_df = govdebt_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
inflation_df = pd.read_csv('world_bank_csv/inflation.csv', skiprows = 4)
inflation_df = inflation_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
lifeexpect_df = pd.read_csv('world_bank_csv/life_expectancy.csv', skiprows = 4)
lifeexpect_df = lifeexpect_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
pop_df = pd.read_csv('world_bank_csv/population.csv', skiprows = 4)
pop_df = pop_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
cleaned_df = pd.DataFrame()
year = []
country = []
gdp_per_capita = []
gini = []
literacy = []
birth = []
electricity = []
female = []
fuel = []
debt = []
inflation = []
life = []
pop = []
countries = gdp_df['Country Name']
i = 0
starting_year = 1960
#find more elegant way to do this
while i <= 60:
current_year = "{}".format(starting_year + i)
gdp_values = gdp_df[current_year]
gini_values = gini_df[current_year]
literacy_values = literacy_df[current_year]
birth_values = birth_df[current_year]
elec_values = electricity_df[current_year]
fem_values = female_labor_df[current_year]
fuel_values = fuel_df[current_year]
debt_values = govdebt_df[current_year]
infl_values = inflation_df[current_year]
life_values = lifeexpect_df[current_year]
pop_values = pop_df[current_year]
for j in range(len(gdp_values)):
if not math.isnan(gdp_values[j]) and not math.isnan(gini_values[j]) \
and not math.isnan(literacy_values[j]) and not math.isnan(birth_values[j]) \
and not math.isnan(elec_values[j]) and not math.isnan(fem_values[j]) \
and not math.isnan(fuel_values[j]) and not math.isnan(debt_values[j]) \
and not math.isnan(infl_values[j]) and not math.isnan(life_values[j]) and not math.isnan(pop_values[j]):
year.append(current_year)
country.append(countries[j])
gdp_per_capita.append(gdp_values[j])
gini.append(gini_values[j])
literacy.append(literacy_values[j])
birth.append(birth_values[j])
electricity.append(elec_values[j])
female.append(fem_values[j])
fuel.append(fuel_values[j])
debt.append(debt_values[j])
inflation.append(infl_values[j])
life.append(life_values[j])
pop.append(pop_values[j])
i += 1
cleaned_df['Country'] = country
cleaned_df['Year'] = year
cleaned_df['GDP_Per_Capita'] = np.log(gdp_per_capita) #take log due to disparity between nations and sheer size
cleaned_df['Gini'] = gini
cleaned_df['Literacy_Rate'] = literacy
cleaned_df['Birth_Rate'] = birth
cleaned_df['Electricity_Access'] = electricity
cleaned_df['Female_Workforce_Percentage'] = female
cleaned_df['Fuel_Export_Percentage'] = fuel
cleaned_df['Central_Government_Debt'] = debt
cleaned_df['Inflation_Rate'] = inflation
cleaned_df['Birth_Life_Expectancy'] = life
cleaned_df['Population'] = np.log(pop) #take log due to disparity between nations and sheer size
cleaned_df
| Country | Year | GDP_Per_Capita | Gini | Literacy_Rate | Birth_Rate | Electricity_Access | Female_Workforce_Percentage | Fuel_Export_Percentage | Central_Government_Debt | Inflation_Rate | Birth_Life_Expectancy | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | South Africa | 1996 | 8.158924 | 60.7 | 82.402100 | 24.572 | 57.600000 | 39.864593 | 10.854996 | 48.912956 | 7.905679 | 60.595000 | 17.558902 |
| 1 | Pakistan | 1998 | 6.133868 | 33.1 | 42.699310 | 36.168 | 70.460000 | 14.530833 | 0.317590 | 79.084692 | 7.526037 | 62.303000 | 18.719623 |
| 2 | Belarus | 1999 | 7.100065 | 31.7 | 99.590729 | 9.300 | 100.000000 | 48.653817 | 9.063458 | 15.103117 | 316.793435 | 67.907317 | 16.120766 |
| 3 | Jamaica | 1999 | 8.124351 | 44.1 | 79.920120 | 21.908 | 84.554112 | 44.340099 | 0.295989 | 88.963644 | 6.916061 | 74.152000 | 14.783512 |
| 4 | Costa Rica | 2000 | 8.239872 | 47.4 | 94.868187 | 19.335 | 96.940000 | 32.631569 | 0.616779 | 38.501849 | 9.389714 | 77.452000 | 15.192353 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99 | Indonesia | 2016 | 8.178307 | 38.6 | 95.376968 | 18.790 | 97.620000 | 38.344075 | 19.289033 | 31.366191 | 2.438924 | 71.035000 | 19.382160 |
| 100 | Peru | 2016 | 8.733110 | 43.6 | 94.173668 | 18.323 | 94.200000 | 45.834549 | 6.414648 | 23.183250 | 3.080883 | 76.044000 | 17.247109 |
| 101 | El Salvador | 2016 | 8.244333 | 40.0 | 88.141769 | 18.587 | 96.000000 | 42.202507 | 2.857031 | 52.065656 | 0.655309 | 72.644000 | 15.664931 |
| 102 | Turkey | 2016 | 9.296023 | 41.9 | 96.167328 | 16.494 | 100.000000 | 32.256343 | 2.119850 | 31.653346 | 8.130478 | 76.860000 | 18.195383 |
| 103 | Uruguay | 2016 | 9.724100 | 39.7 | 98.561470 | 14.057 | 99.700000 | 45.411290 | 0.482251 | 46.656630 | 16.613684 | 77.498000 | 15.046361 |
104 rows × 13 columns
#try to train linear model using all of the variables predicting gini coefficient
#create training and testing dataset to train and test a linear model
training_df, test_df = train_test_split(cleaned_df, test_size = 0.15)
import statsmodels.api as sm
import statsmodels.formula.api as smf
#start with a linear regression with all of the variables
full_model = smf.ols("Gini ~GDP_Per_Capita+Literacy_Rate+Birth_Rate+Electricity_Access+Female_Workforce_Percentage+Fuel_Export_Percentage+Central_Government_Debt+Inflation_Rate+Birth_Life_Expectancy+Population", data = training_df).fit()
print(full_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Gini R-squared: 0.388
Model: OLS Adj. R-squared: 0.309
Method: Least Squares F-statistic: 4.887
Date: Sun, 19 Sep 2021 Prob (F-statistic): 1.91e-05
Time: 23:06:59 Log-Likelihood: -284.58
No. Observations: 88 AIC: 591.2
Df Residuals: 77 BIC: 618.4
Df Model: 10
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept -34.9606 27.056 -1.292 0.200 -88.836 18.915
GDP_Per_Capita 1.8324 1.429 1.282 0.204 -1.014 4.678
Literacy_Rate 0.0468 0.122 0.383 0.703 -0.196 0.290
Birth_Rate 0.9905 0.243 4.079 0.000 0.507 1.474
Electricity_Access 0.0342 0.144 0.238 0.813 -0.252 0.320
Female_Workforce_Percentage 0.4781 0.155 3.078 0.003 0.169 0.787
Fuel_Export_Percentage -0.0092 0.049 -0.186 0.853 -0.107 0.089
Central_Government_Debt 0.1045 0.038 2.744 0.008 0.029 0.180
Inflation_Rate -0.0826 0.151 -0.546 0.587 -0.384 0.219
Birth_Life_Expectancy -0.1767 0.373 -0.474 0.637 -0.919 0.566
Population 1.5265 0.525 2.905 0.005 0.480 2.573
==============================================================================
Omnibus: 4.210 Durbin-Watson: 2.184
Prob(Omnibus): 0.122 Jarque-Bera (JB): 3.683
Skew: -0.494 Prob(JB): 0.159
Kurtosis: 3.166 Cond. No. 6.47e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.47e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
From the residuals plot below, we see that the residuals are relatively evenly distributed above and below the zero line. We meet the condition of Linearity.
The residuals are relatively evenly distributed from left to right. We meet the condition of Constant Variability of Residuals.
sns.regplot(x = full_model.fittedvalues, y = full_model.resid, ci = None)
plt.xlabel("Fitted Value")
plt.ylabel("Residual")
plt.show()
We see that the residuals are relatively normally distributed (slight left skew). We proceed with caution.
plt.hist(full_model.resid)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()
Since we are now working with more than one independent variable, we want to ensure that there do not exist any strong associations between two or more of these independent variables. From the variance inflation factor table calculated below, there are many variables that may have linear relationships with each other.
GDP Per Capita, Literacy Rate, Electricity Access, Female Workforce Percentage, Life Expectancy, and Population all have very high VIFs. This makes sense, as wealthier nations typically have a higher literacy rate, birth rate, access to electricity, female labor percentage, life expectancy, and population.
I did a poor job of selecting variables. There exists multicolinearity!
from statsmodels.stats.outliers_influence import variance_inflation_factor
#calculate variance inflation factor for each variable to determine multicolinearity
variables = training_df[['GDP_Per_Capita', 'Literacy_Rate', 'Birth_Rate', 'Electricity_Access', 'Female_Workforce_Percentage', 'Fuel_Export_Percentage', 'Central_Government_Debt', 'Inflation_Rate', 'Birth_Life_Expectancy', 'Population']]
multicolin_df = pd.DataFrame()
multicolin_df['Variables'] = variables.columns
multicolin_df['VIF'] = [variance_inflation_factor(variables.values, i) for i in range(len(variables.columns))]
print(multicolin_df)
Variables VIF 0 GDP_Per_Capita 275.859090 1 Literacy_Rate 260.857715 2 Birth_Rate 17.897844 3 Electricity_Access 337.836103 4 Female_Workforce_Percentage 56.430721 5 Fuel_Export_Percentage 1.750255 6 Central_Government_Debt 8.310016 7 Inflation_Rate 3.117016 8 Birth_Life_Expectancy 1040.140881 9 Population 128.488199
reduced_model = smf.ols("Gini ~GDP_Per_Capita+Fuel_Export_Percentage+Inflation_Rate", data = training_df).fit()
print(reduced_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Gini R-squared: 0.041
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 1.191
Date: Sun, 19 Sep 2021 Prob (F-statistic): 0.318
Time: 23:06:59 Log-Likelihood: -304.37
No. Observations: 88 AIC: 616.7
Df Residuals: 84 BIC: 626.6
Df Model: 3
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 46.6658 8.048 5.798 0.000 30.662 62.670
GDP_Per_Capita -0.5175 0.896 -0.578 0.565 -2.298 1.263
Fuel_Export_Percentage 0.0823 0.054 1.534 0.129 -0.024 0.189
Inflation_Rate -0.2059 0.160 -1.283 0.203 -0.525 0.113
==============================================================================
Omnibus: 1.680 Durbin-Watson: 1.889
Prob(Omnibus): 0.432 Jarque-Bera (JB): 1.496
Skew: 0.185 Prob(JB): 0.473
Kurtosis: 2.479 Cond. No. 197.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Checking for VIF again with the reduced model, we see that all are relatively low. We meet the no Multicolinearity condition. We proceed with caution regarding the independence of observations condition for the same reason as before.
#calculate variance inflation factor for each variable to determine multicolinearity with the remaining variables
variables = training_df[['GDP_Per_Capita', 'Fuel_Export_Percentage', 'Inflation_Rate']]
multicolin_df = pd.DataFrame()
multicolin_df['Variables'] = variables.columns
multicolin_df['VIF'] = [variance_inflation_factor(variables.values, i) for i in range(len(variables.columns))]
multicolin_df
| Variables | VIF | |
|---|---|---|
| 0 | GDP_Per_Capita | 2.489639 |
| 1 | Fuel_Export_Percentage | 1.488698 |
| 2 | Inflation_Rate | 2.277685 |
With an Rsquared of 0.046, it seems like adding the two variables of Fuel Export Percentage and Inflation Rate created a linear model that did a worse job.
What if I do a better job of selecting indicators to examine?
I will look at the economic makeup of a nation, considering indicators like Goods and Services Imports (as % of GDP), Goods and Services Exports (as % of GDP), High Tech Exports (as % of manufactured exports), International Tourism Expenditures (as % of total imports), Ore/Metal Exports (as % of merchandise exports), and Fuel Exports (as % of merchandise exports).
The reasoning behind this is that a nation's exports typically serve as a good snapshot into the nation's economy. Reading about states like the UAE in the news that are known for stark income disparities, I wonder if economies that rely on natural resource exports (like Fuel and Ore/Metal) are more likely to have stark economic disparities. The logic behind that hypothesis could be that it is easier for a few individuals to control oil production and harder for new players to secure the infrastructure/rights to compete.
gsi_df = pd.read_csv('world_bank_csv/goods_and_services_imports.csv', skiprows = 4)
gsi_df = gsi_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
gse_df = pd.read_csv('world_bank_csv/goods_and_services_exports.csv', skiprows = 4)
gse_df = gse_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
teche_df = pd.read_csv('world_bank_csv/high_tech_exports.csv', skiprows = 4)
teche_df = teche_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
tourism_df = pd.read_csv('world_bank_csv/international_tourism_expenditures.csv', skiprows = 4)
tourism_df = tourism_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
ore_metale_df = pd.read_csv('world_bank_csv/ore_metal_export.csv', skiprows = 4)
ore_metale_df = ore_metale_df.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 65'])
exports_df = pd.DataFrame()
year = []
country = []
gdp_per_capita = []
gini = []
gsi = []
gse = []
teche = []
tourism = []
ore_metale = []
fuel = []
inflation = []
countries = gdp_df['Country Name']
i = 0
starting_year = 1960
#find more elegant way to do this
while i <= 60:
current_year = "{}".format(starting_year + i)
gdp_values = gdp_df[current_year]
gini_values = gini_df[current_year]
gsi_values = gsi_df[current_year]
gse_values = gse_df[current_year]
teche_values = teche_df[current_year]
tourism_values = tourism_df[current_year]
fuel_values = fuel_df[current_year]
ore_metale_values = ore_metale_df[current_year]
infl_values = inflation_df[current_year]
for j in range(len(gdp_values)):
if not math.isnan(gdp_values[j]) and not math.isnan(gini_values[j]) \
and not math.isnan(gsi_values[j]) and not math.isnan(gse_values[j]) \
and not math.isnan(teche_values[j]) and not math.isnan(tourism_values[j]) \
and not math.isnan(fuel_values[j]) and not math.isnan(ore_metale_values[j]) \
and not math.isnan(infl_values[j]):
year.append(current_year)
country.append(countries[j])
gdp_per_capita.append(gdp_values[j])
gini.append(gini_values[j])
gsi.append(gsi_values[j])
gse.append(gse_values[j])
teche.append(teche_values[j])
tourism.append(tourism_values[j])
fuel.append(fuel_values[j])
ore_metale.append(ore_metale_values[j])
inflation.append(infl_values[j])
i += 1
exports_df['Country'] = country
exports_df['Year'] = year
exports_df['GDP_Per_Capita'] = np.log(gdp_per_capita) #take log due to disparity between nations and sheer size
exports_df['Gini'] = gini
exports_df['Goods_Services_Import'] = gsi
exports_df['Goods_Services_Export'] = gse
exports_df['Technology_Export'] = teche
exports_df['Tourism_Expenditure'] = tourism
exports_df['Ore_Metal_Export'] = ore_metale
exports_df['Fuel_Export_Percentage'] = fuel
exports_df['Inflation_Rate'] = inflation
exports_df
| Country | Year | GDP_Per_Capita | Gini | Goods_Services_Import | Goods_Services_Export | Technology_Export | Tourism_Expenditure | Ore_Metal_Export | Fuel_Export_Percentage | Inflation_Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Argentina | 2007 | 8.888129 | 46.2 | 18.282420 | 22.662750 | 7.038996 | 9.480555 | 3.815607 | 10.926309 | 14.939925 |
| 1 | Armenia | 2007 | 8.051749 | 31.2 | 38.785690 | 18.189357 | 1.139896 | 11.095929 | 31.456524 | 1.196460 | 4.277876 |
| 2 | Belgium | 2007 | 10.697902 | 29.2 | 74.166642 | 78.301805 | 8.480224 | 5.406683 | 3.653725 | 7.081016 | 1.931721 |
| 3 | Bulgaria | 2007 | 8.680180 | 36.1 | 71.221940 | 52.387706 | 6.328294 | 6.077765 | 17.770680 | 14.603772 | 11.083523 |
| 4 | Bolivia | 2007 | 7.224483 | 54.5 | 34.266255 | 41.795657 | 4.675422 | 9.299056 | 26.164308 | 47.924153 | 7.373355 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 696 | Paraguay | 2019 | 8.590623 | 45.7 | 35.162269 | 36.199998 | 7.182844 | 4.198890 | 0.916135 | 20.502605 | 3.011893 |
| 697 | El Salvador | 2019 | 8.335127 | 38.8 | 46.300953 | 29.675841 | 5.126159 | 4.312070 | 1.070888 | 3.373800 | 0.711895 |
| 698 | Thailand | 2019 | 8.964058 | 34.9 | 50.144152 | 59.487486 | 23.610647 | 6.176720 | 1.602224 | 3.633637 | 0.946249 |
| 699 | Turkey | 2019 | 9.118948 | 41.9 | 29.941459 | 32.741404 | 3.043568 | 2.158029 | 3.719998 | 4.494087 | 13.906207 |
| 700 | Uruguay | 2019 | 9.780643 | 39.7 | 21.856853 | 27.750190 | 8.271267 | 11.118857 | 0.325450 | 1.189791 | 8.522967 |
701 rows × 11 columns
This time around, I will be checking for Multicolinearity to know which variables to remove before proceeding with the regression.
We see below that Goods and Services Imports and Goods and Services Exports have relatively high VIFs in the 20s. This definitely makes sense. In my reduced model, I will remove Imports and only consider Exports.
#create a training dataset and a testing dataset to avoid overfitting and gauge effectiveness of model
training_df, test_df = train_test_split(exports_df, test_size = 0.15)
#calculate variance inflation factor for each variable to determine multicolinearity
variables = training_df[['GDP_Per_Capita', 'Goods_Services_Import', 'Goods_Services_Export', 'Technology_Export', 'Ore_Metal_Export', 'Fuel_Export_Percentage', 'Inflation_Rate', 'Tourism_Expenditure']]
multicolin_df = pd.DataFrame()
multicolin_df['Variables'] = variables.columns
multicolin_df['VIF'] = [variance_inflation_factor(variables.values, i) for i in range(len(variables.columns))]
multicolin_df
| Variables | VIF | |
|---|---|---|
| 0 | GDP_Per_Capita | 12.805887 |
| 1 | Goods_Services_Import | 26.998400 |
| 2 | Goods_Services_Export | 23.837788 |
| 3 | Technology_Export | 3.124179 |
| 4 | Ore_Metal_Export | 1.492873 |
| 5 | Fuel_Export_Percentage | 1.633653 |
| 6 | Inflation_Rate | 1.502022 |
| 7 | Tourism_Expenditure | 4.817011 |
With these new VIF values, we meet the condition for no multicolinearity.
#Goods_Services_Import and Goods_Services_Export have high VIF
#Remove Goods_Services_Import
variables = training_df[['GDP_Per_Capita', 'Goods_Services_Export', 'Technology_Export', 'Ore_Metal_Export', 'Fuel_Export_Percentage', 'Inflation_Rate', 'Tourism_Expenditure']]
multicolin_df = pd.DataFrame()
multicolin_df['Variables'] = variables.columns
multicolin_df['VIF'] = [variance_inflation_factor(variables.values, i) for i in range(len(variables.columns))]
multicolin_df
| Variables | VIF | |
|---|---|---|
| 0 | GDP_Per_Capita | 12.026371 |
| 1 | Goods_Services_Export | 4.694688 |
| 2 | Technology_Export | 2.928452 |
| 3 | Ore_Metal_Export | 1.482258 |
| 4 | Fuel_Export_Percentage | 1.534152 |
| 5 | Inflation_Rate | 1.491209 |
| 6 | Tourism_Expenditure | 4.641619 |
#perform multiple variable linear regression
exports_model = smf.ols("Gini ~GDP_Per_Capita+Goods_Services_Export+Technology_Export+Ore_Metal_Export+Fuel_Export_Percentage+Tourism_Expenditure", data = training_df).fit()
print(exports_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Gini R-squared: 0.256
Model: OLS Adj. R-squared: 0.248
Method: Least Squares F-statistic: 33.69
Date: Sun, 19 Sep 2021 Prob (F-statistic): 5.50e-35
Time: 23:06:59 Log-Likelihood: -2004.0
No. Observations: 595 AIC: 4022.
Df Residuals: 588 BIC: 4053.
Df Model: 6
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 57.4509 2.466 23.293 0.000 52.607 62.295
GDP_Per_Capita -1.5757 0.285 -5.530 0.000 -2.135 -1.016
Goods_Services_Export -0.0973 0.012 -8.237 0.000 -0.120 -0.074
Technology_Export 0.0269 0.032 0.836 0.404 -0.036 0.090
Ore_Metal_Export 0.0865 0.026 3.329 0.001 0.035 0.138
Fuel_Export_Percentage 0.0048 0.017 0.287 0.774 -0.028 0.038
Tourism_Expenditure -0.4858 0.093 -5.216 0.000 -0.669 -0.303
==============================================================================
Omnibus: 24.603 Durbin-Watson: 2.036
Prob(Omnibus): 0.000 Jarque-Bera (JB): 26.804
Skew: 0.519 Prob(JB): 1.51e-06
Kurtosis: 3.057 Cond. No. 483.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We see that the residuals are relatively evenly distributed above and below the zero line. Ignoring what seems to be outliers for fitted values roughly below 25 and above 48, we can proceed with caution.
We see that the residuals are relatively evenly distributed from left to right. The residuals seem to be in the shape of a circle, but we will proceed with caution.
sns.regplot(x = exports_model.fittedvalues, y = exports_model.resid, ci = None)
plt.xlabel("Fitted Value")
plt.ylabel("Residual")
plt.show()
The histogram of residuals seems relatively normal. We meet the condition of Normality of Residuals.
plt.hist(exports_model.resid)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()
Now we use the same model from the training data on the test data to check for overfitting.
predictions_df = pd.DataFrame(exports_model.predict(test_df))
results_df = pd.DataFrame()
results_df['predictions'] = predictions_df[0]
results_df['residuals'] = results_df['predictions'] - test_df['Gini']
results_df
| predictions | residuals | |
|---|---|---|
| 682 | 36.981301 | -4.418699 |
| 340 | 33.571984 | 7.071984 |
| 132 | 40.038967 | 5.538967 |
| 427 | 49.703907 | 15.403907 |
| 117 | 39.951654 | 11.751654 |
| ... | ... | ... |
| 372 | 34.556963 | -6.343037 |
| 118 | 38.029680 | 8.129680 |
| 267 | 31.164197 | 2.164197 |
| 485 | 38.141450 | 11.141450 |
| 302 | 19.886958 | -14.413042 |
106 rows × 2 columns
sns.regplot(x = results_df['predictions'], y = results_df['residuals'], ci = None)
plt.xlabel("Fitted Value")
plt.ylabel("Residual")
plt.show()
plt.hist(results_df['residuals'])
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()
The residuals of the model and test data seems relatively evenly distributed and normal.
With this, we see that when using GDP Per Capita, Goods and Services Exports, Technology Exports, Ore Metal Exports, Fuel Export Percentage, and Tourism Expenditure to create a multiple variable linear regression model for Gini Index, we have relatively weak linear relationship with and R squared of .22. (Stronger than with just the GDP Per Capita!)
The model provides interesting insights.
From the model we see that Technology Exports and Fuel Exports have high p-values, which may suggest that these two variables are not that important in determining Gini Index. On the other hand, all of the other variables have incredibly low p-values.
GDP Per Capita, Goods and Servicex Exports, and Tourism Expenditures have negative cofficients. This means that nations with higher levels of those three variables have lower levels of income inequalirty.
Technology, Ore/Metal, and Fuel exports all have positive coefficients. This may suggest that nations with higher levels of these three variables have higher levels of income inequality.
print(exports_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Gini R-squared: 0.256
Model: OLS Adj. R-squared: 0.248
Method: Least Squares F-statistic: 33.69
Date: Sun, 19 Sep 2021 Prob (F-statistic): 5.50e-35
Time: 23:07:00 Log-Likelihood: -2004.0
No. Observations: 595 AIC: 4022.
Df Residuals: 588 BIC: 4053.
Df Model: 6
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 57.4509 2.466 23.293 0.000 52.607 62.295
GDP_Per_Capita -1.5757 0.285 -5.530 0.000 -2.135 -1.016
Goods_Services_Export -0.0973 0.012 -8.237 0.000 -0.120 -0.074
Technology_Export 0.0269 0.032 0.836 0.404 -0.036 0.090
Ore_Metal_Export 0.0865 0.026 3.329 0.001 0.035 0.138
Fuel_Export_Percentage 0.0048 0.017 0.287 0.774 -0.028 0.038
Tourism_Expenditure -0.4858 0.093 -5.216 0.000 -0.669 -0.303
==============================================================================
Omnibus: 24.603 Durbin-Watson: 2.036
Prob(Omnibus): 0.000 Jarque-Bera (JB): 26.804
Skew: 0.519 Prob(JB): 1.51e-06
Kurtosis: 3.057 Cond. No. 483.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Maybe the shape of the relationship is polynomial rather than linear?
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
#try to see if there is a polynomial relationship using sklearn
data = training_df.drop(columns = ['Country', 'Year', 'Gini'])
#try fitting with degree 2
degree_two_poly = PolynomialFeatures(degree = 2)
#transform the training data to be used in polynomial regression
transformed_training_df = degree_two_poly.fit_transform(data)
#create the polynomial regression model
regression = linear_model.LinearRegression()
regression.fit(transformed_training_df, training_df['Gini'])
results_df = pd.DataFrame()
results_df['Actual Gini'] = training_df['Gini']
results_df['Predictions'] = regression.predict(transformed_training_df)
results_df['Residuals'] = results_df['Predictions'] - results_df['Actual Gini']
sns.regplot(x = results_df['Predictions'], y = results_df['Residuals'], ci = None)
plt.xlabel("Fitted Value")
plt.ylabel("Residual")
plt.show()
results_df
| Actual Gini | Predictions | Residuals | |
|---|---|---|---|
| 396 | 52.6 | 48.568664 | -4.031336 |
| 288 | 39.0 | 40.357947 | 1.357947 |
| 380 | 24.6 | 38.604153 | 14.004153 |
| 56 | 25.2 | 27.801933 | 2.601933 |
| 238 | 27.8 | 28.134458 | 0.334458 |
| ... | ... | ... | ... |
| 107 | 48.9 | 40.556215 | -8.343785 |
| 65 | 27.5 | 33.919165 | 6.419165 |
| 585 | 49.7 | 45.964419 | -3.735581 |
| 643 | 28.2 | 26.890325 | -1.309675 |
| 28 | 34.0 | 36.481778 | 2.481778 |
595 rows × 3 columns
Using the same model we created with the training data, we predict values within the test dat to check for overfitting. The result is a relatively evenly distributed residuals plot.
#try using the same model to predict the test data (to guage overfitting)
test = test_df.drop(columns = ['Country', 'Year', 'Gini'])
transformed_test_df = degree_two_poly.fit_transform(test)
print(test)
print("Test Data Shape:", transformed_test_df.shape)
test_results_df = pd.DataFrame()
test_results_df['Actual Gini'] = test_df['Gini']
test_results_df['Predictions'] = regression.predict(transformed_test_df)
test_results_df['Residuals'] = test_results_df['Predictions'] - test_results_df['Actual Gini']
sns.regplot(x = test_results_df['Predictions'], y = test_results_df['Residuals'], ci = None)
plt.xlabel("Fitted Value")
plt.ylabel("Residual")
plt.show()
GDP_Per_Capita Goods_Services_Import Goods_Services_Export \
682 11.051912 15.225040 12.268198
340 9.910124 70.364035 76.058382
132 7.692470 61.135090 14.016103
427 6.336112 29.795738 15.945863
117 8.876994 33.927645 41.838422
.. ... ... ...
372 9.678758 20.441813 25.845337
118 6.769900 78.680806 54.698347
267 8.354116 47.572687 28.937487
485 7.912957 57.435227 31.895451
302 11.578236 155.417715 186.444307
Technology_Export Tourism_Expenditure Ore_Metal_Export \
682 18.741741 5.531621 2.957412
340 17.338109 3.156099 2.178478
132 0.130350 9.690088 8.963339
427 60.299853 3.985759 45.905542
117 31.686887 3.381947 11.703007
.. ... ... ...
372 10.732815 12.669831 4.575398
118 5.044538 7.668563 1.627652
267 0.926807 23.477814 10.911833
485 4.617076 8.259150 1.727114
302 8.680408 3.863811 5.291687
Fuel_Export_Percentage Inflation_Rate
682 13.529994 2.400981
340 2.957814 1.364705
132 0.215936 1.852733
427 27.031870 -0.405677
117 69.514157 4.691085
.. ... ...
372 70.559481 5.320139
118 3.400654 4.035304
267 26.588665 1.042425
485 0.132511 9.564366
302 0.906016 2.556498
[106 rows x 8 columns]
Test Data Shape: (106, 45)